1,488 research outputs found

    CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

    Full text link
    Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML -- ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics. We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.Comment: published in ICDE 202

    Lipschitz equivalence of self-similar sets and hyperbolic boundaries

    Full text link
    In [9] Kaimanovich introduced the concept of augmented tree on the symbolic space of a self-similar set. It is hyperbolic in the sense of Gromov, and it was shown in [13] that under the open set condition, a self-similar set can be identified with the hyperbolic boundary of the tree. In the paper, we investigate in detail a class of simple augmented trees and the Lipschitz equivalence of such trees. The main purpose is to use this to study the Lipschitz equivalence problem of the totally disconnected self-similar sets which has been undergoing some extensive development recently.Comment: Advances in Mathematics, accepted (2012). 29 pages, 10 figure

    SAINE: Scientific Annotation and Inference Engine of Scientific Research

    Full text link
    We present SAINE, an Scientific Annotation and Inference ENgine based on a set of standard open-source software, such as Label Studio and MLflow. We show that our annotation engine can benefit the further development of a more accurate classification. Based on our previous work on hierarchical discipline classifications, we demonstrate its application using SAINE in understanding the space for scholarly publications. The user study of our annotation results shows that user input collected with the help of our system can help us better understand the classification process. We believe that our work will help to foster greater transparency and better understand scientific research. Our annotation and inference engine can further support the downstream meta-science projects. We welcome collaboration and feedback from the scientific community on these projects. The demonstration video can be accessed from https://youtu.be/yToO-G9YQK4. A live demo website is available at https://app.heartex.com/user/signup/?token=e2435a2f97449fa1 upon free registration.Comment: Under review in IJCNLP-AACL Demo 202

    Hierarchical Classification of Research Fields in the "Web of Science" Using Deep Learning

    Full text link
    This paper presents a hierarchical classification system that automatically categorizes a scholarly publication using its abstract into a three-tier hierarchical label set (discipline, field, subfield) in a multi-class setting. This system enables a holistic categorization of research activities in the mentioned hierarchy in terms of knowledge production through articles and impact through citations, permitting those activities to fall into multiple categories. The classification system distinguishes 44 disciplines, 718 fields and 1,485 subfields among 160 million abstract snippets in Microsoft Academic Graph (version 2018-05-17). We used batch training in a modularized and distributed fashion to address and allow for interdisciplinary and interfield classifications in single-label and multi-label settings. In total, we have conducted 3,140 experiments in all considered models (Convolutional Neural Networks, Recurrent Neural Networks, Transformers). The classification accuracy is > 90% in 77.13% and 78.19% of the single-label and multi-label classifications, respectively. We examine the advantages of our classification by its ability to better align research texts and output with disciplines, to adequately classify them in an automated way, and to capture the degree of interdisciplinarity. The proposed system (a set of pre-trained models) can serve as a backbone to an interactive system for indexing scientific publications in the future.Comment: Under review in QS

    Modulator-Dependent RBPs Changes Alternative Splicing Outcomes in Kidney Cancer

    Get PDF
    Alternative splicing alterations can contribute to human disease. The ability of an RNA-binding protein to regulate alternative splicing outcomes can be modulated by a variety of genetic and epigenetic mechanisms. In this study, we use a computational framework to investigate the roles of certain genes, termed modulators, on changing RBPs’ effect on splicing regulation. A total of 1,040,254 modulator-mediated RBP-splicing interactions were identified, including 137 RBPs, 4,309 splicing events and 2,905 modulator candidates from TCGA-KIRC RNA sequencing data. Modulators function categories were defined according to the correlation changes between RBPs expression and their targets splicing outcomes. QKI, as one of the RBPs influencing the most splicing events, attracted our attention in this study: 2,014 changing triplets were identified, including 1,101 modulators and 187 splicing events. Pathway enrichment analysis showed that QKI splicing targets were enriched in tight junction pathway, endocytosis and MAPK signaling pathways, all of which are highly associated with cancer development and progression. This is the first instance of a comprehensive study on how alternative splicing outcomes changes are associated with different expression level of certain proteins, even though they were regulated by the same RBP. Our work may provide a novel view on understanding alternative splicing mechanisms in kidney cancer

    Experimental and theoretical study of the photoelectron spectra of MnOx-(x=1-3) clusters

    Get PDF
    We report a combined experimental and theoretical investigation of MnO−x and MnOx(x=1–3) clusters. Theoretically, geometrical configurations of various isomers of the clusters were optimized and vertical detachment energies for the anions were evaluated. The ground state of MnO− was predicted to be 5Σ+, followed by an excited state (7Σ+) 0.14 eV higher in energy. The ground state of MnO−2 is 5B2, with a 3B1 isomer 0.15 eV higher. MnO−3 is predicted to be a singlet D3h cluster. Vibrationally resolved photoelectron spectra of MnO−x were measured at several photon energies and under various experimental conditions, and were interpreted based on the theoretical results. The electron affinities of MnO, MnO2,and MnO3 were determined to be 1.375 (0.010), 2.06 (0.03), and 3.335 (0.010), respectively. Five excited states of MnO were observed and assigned using the theoretical results. The 7Σ+ excited state of MnO− was found to be significantly populated and was distinguished from the ground state of the anion by temperature dependent studies. We observed two isomers for MnO−2 and the detachment features from both isomers were assigned. Only one vibrationally resolved band was observed for MnO−3, which corresponds to transitions from the ground state of MnO−3 to that of MnO3. The combined experimental and theoretical studies allow us to elucidate the complicated electronic and geometricstructures of the various manganese oxide clusters and their anions

    High Glucose Alters Fetal Rat Islet Transcriptome and Induces Progeny Islet Dysfunction

    Get PDF
    Offspring of diabetic mothers are susceptible to developing type 2 diabetes due to pancreatic islet dysfunction. However, the initiating molecular pathways leading to offspring pancreatic islet dysfunction are unknown. We hypothesized that maternal hyperglycemia alters offspring pancreatic islet transcriptome and negatively impacts offspring islet function. We employed an infusion model capable of inducing localized hyperglycemia in fetal rats residing in the left uterine horn, thus avoiding other factors involved in programming offspring pancreatic islet health. While maintaining euglycemia in maternal dams and right uterine horn control fetuses, hyperglycemic fetuses in the left uterine horn had higher serum insulin and pancreatic beta cell area. Upon completing infusion from GD20 to 22, RNA sequencing was performed on GD22 islets to identify the hyperglycemia-induced altered gene expression. Ingenuity pathway analysis of the altered transcriptome found that diabetes mellitus and inflammation/cell death pathways were enriched. Interestingly, the downregulated genes modulate more diverse biological processes, which includes responses to stimuli and developmental processes. Next, we performed ex and in vivo studies to evaluate islet cell viability and insulin secretory function in weanling and adult offspring. Pancreatic islets of weanlings exposed to late gestation hyperglycemia had decreased cell viability in basal state and glucose-induced insulin secretion. Lastly, adult offspring exposed to in utero hyperglycemia also exhibited glucose intolerance and insulin secretory dysfunction. Together, our results demonstrate that late gestational hyperglycemia alters the fetal pancreatic islet transcriptome and increases offspring susceptibility to developing pancreatic islet dysfunction
    • …
    corecore